# Conversational Search

In this notebook you will implement the following steps:

- **Answer selection + evaluation**: Implement a *search-based* conversation framework evaluation framework to evaluate conversation topics made up of conversation turns.
- **Answer ranking**: Implement a *re-ranking method* to sort the initial search results. Evaluate the re-ranked results.
- **Conversation context**: Implement a conversational context modeling method to keep track of the conversation state. 

Submission dates:
- **20 October**: retrieval + evaluation
- **20 November**: pass
age re-ranking
- **20 December**: conversation state tracking

## Test bed and conversation topics
The TREC CAST corpus (http://www.treccast.ai/) for Conversational Search is indexed in this cluster and available to be searched behind an ElasticSearch API.

The queries and the relevance judgments are available through class `ConvSearchEvaluation`:

# Google Colab Setup

The following steps are already implemented in the cell bellow. You need to download the starting project folder, upload it, adjust the paths, and finally run the notebook.


1.   Download the shared project folder as a zip;
2.   Unzip and re-upload to a folder of your own GDrive;
3.   Mount your GDrive on the Colab working environment;

Note: You will be asked to complete a Google Authorization procedure by following a link and pasting a code on the notebook.

4.   Copy the contents from the folder you uploaded to the Colab working dir;
5.   Add sys path locations to run aux Python scripts;
6.   Install dependencies.

After going though all these steps you should be able to run all the cells in the notebook.

In [1]:
# Colab Setup
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

# After downloading the shared starting point folder as a Zip
# Unzip it and re-upload it to a location on your GDrive

# This command copies the contents from the folder you uploaded to GDrive, to the colab working dir
!cp -r /content/drive/My\ Drive/faculdade/fct-miei/04_ano4_\(year4\)/semestre1/ri/ProjectoRI2020 /content

# Add working dir to the sys path, so that we can find the aux python files when running the Notebook
import sys
if not '/content/ProjectoRI2020' in sys.path:
  sys.path += ['/content/ProjectoRI2020']

# Finally install required dependencies to run the notebook
!pip install elasticsearch
!pip install bert-serving-client
!pip install transformers
!pip install spacy
!python -m spacy download en_core_web_sm
%tensorflow_version 2.x
!pip install t5==0.5.0
!rm -rf /content/t5-canard-v2
!cp -r /content/drive/My\ Drive/faculdade/fct-miei/04_ano4_\(year4\)/semestre1/ri/infos_projeto/t5-canard-v2.zip /content/t5-canard-v2.zip
!unzip /content/t5-canard-v2.zip

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Archive:  /content/t5-canard-v2.zip
   creating: t5-canard-v2/
  inflating: t5-canard-v2/saved_model.pb  
   creating: t5-canard-v2/variables/
  inflating: t5-canard-v2/variables/variables.index  
  inflating: t5-canard-v2/variables/variables.data-00000-of-00002  
  inflating: t5-canard-v2/variables/variables.data-00001-of-00002  


In [2]:
# common imports
import numpy as np
import pprint as pprint
import TRECCASTeval as trec
import ElasticSearchSimpleAPI as elastic
import project

In [3]:
# variables
infosDir = "/content/drive/My Drive/faculdade/fct-miei/04_ano4_(year4)/semestre1/ri/infos_projeto"

updateUtterances = False
updateElasticSearchResults = False
updateBERTResults = False
updatePhase01Metrics = False
updatePhase02Metrics = False
updatePhase03Metrics = False
updatePhase03Method01Metrics = False
updatePhase03Method02Metrics = False
updatePhase03Method03Metrics = False
updatePhase03MethodFinalMetrics = False
updatePlots = True

utterancesPath = {
  "elastic_search": infosDir + "/utterances" + "/normal.txt",
  "elastic_search_method1": infosDir + "/utterances" + "/utterance.txt",
  "elastic_search_method2": infosDir + "/utterances" + "/entities.txt",
  "elastic_search_method3": infosDir + "/utterances" + "/t5.txt",
  "elastic_search_methodFinal": infosDir + "/utterances" + "/final.txt"
}

if updateUtterances and updatePhase01Metrics:
  open(utterancesPath["elastic_search"], "w").truncate()
if updateUtterances and updatePhase03Metrics:
  if updateUtterances and updatePhase03Method01Metrics:
    open(utterancesPath["elastic_search_method1"], "w").truncate()
  if updateUtterances and updatePhase03Method02Metrics:
    open(utterancesPath["elastic_search_method2"], "w").truncate()
  if updateUtterances and updatePhase03Method03Metrics:
    open(utterancesPath["elastic_search_method3"], "w").truncate()
  if updateUtterances and updatePhase03MethodFinalMetrics:
    open(utterancesPath["elastic_search_methodFinal"], "w").truncate()

phase01MetricsFileName = {
    "train": "phase01Train.npz",
    "test":  "phase01Test.npz"
}
phase02MetricsFileName = "phase02.npz"
phase03MetricsFileName = "phase03.npz"

relDocsPerTurn = 100
topicsIDs = {
    "train": (1, 2, 4, 7, 15, 17,18,22,23,24,25,27,30),
    "test": (31, 32, 33, 34, 37, 40, 49, 50, 54, 56, 58, 59, 61, 67, 68, 69, 75, 77, 78, 79)
}

testBed = trec.ConvSearchEvaluation()
es = elastic.ESSimpleAPI()

In [4]:
# conversations
"""import TRECCASTeval as trec
import numpy as np

import ElasticSearchSimpleAPI as es
import numpy as np

import pprint as pprint

test_bed = trec.ConvSearchEvaluation()"""

print()
print("========================================== Training conversations =====")
topics = {}
for topic in testBed.train_topics:
    conv_id = topic['number']

    if conv_id not in topicsIDs["train"]:
        continue

    print()
    print(conv_id, "  ", topic['title'])

    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, utterance)
        topics[topic_turn_id] = utterance

print()
print("========================================== Test conversations =====")
for topic in testBed.test_topics:
    conv_id = topic['number']

    if conv_id not in topicsIDs["test"]:
        continue

    print()
    print(conv_id, "  ", topic['title'])

    for turn in topic['turn']:
        turn_id = turn['number']
        utterance = turn['raw_utterance']
        topic_turn_id = '%d_%d'% (conv_id, turn_id)
        
        print(topic_turn_id, utterance)
        topics[topic_turn_id] = utterance




1    Career choice for Nursing and Physician's Assistant
1_1 What is a physician's assistant?
1_2 What are the educational requirements required to become one?
1_3 What does it cost?
1_4 What's the average starting salary in the UK?
1_5 What about in the US?
1_6 What school subjects are needed to become a registered nurse?
1_7 What is the PA average salary vs an RN?
1_8 What the difference between a PA and a nurse practitioner?
1_9 Do NPs or PAs make more?
1_10 Is a PA above a NP?
1_11 What is the fastest way to become a NP?
1_12 How much longer does it take to become a doctor after being an NP?

2    Goat breeds
2_1 What are the main breeds of goat?
2_2 Tell me about boer goats.
2_3 What breed is good for meat?
2_4 Are angora goats good for it?
2_5 What about boer goats?
2_6 What are pygmies used for?
2_7 What is the best for fiber production?
2_8 How long do Angora goats live?
2_9 Can you milk them?
2_10 How many can you have per acre?
2_11 Are they profitable?

4    The Neolithic 

Search example:

In [5]:
"""elastic = es.ESSimpleAPI()
results = elastic.search_body(topics['33_1'], numDocs = 10)
print(results)"""

"elastic = es.ESSimpleAPI()\nresults = elastic.search_body(topics['33_1'], numDocs = 10)\nprint(results)"

## Retrieval with the training conversations
The ElasticSearchSimpleAPI notebook illustrates how to use ElasticSearch. Use this API to retrieve the top 100 ranked passages for each conversation turn. 

To evaluate the results you should use the provided `ConvSearchEvaluation` class. Examine and discuss the recall metric results. In terms of metrics, discuss what should be your goals for each step of the project.

In [6]:
# variables
topics = testBed.train_topics
topicsToIgnore = topicsIDs["train"]
relevanceJudgments = testBed.relevance_judgments
thisSetName = "train"

phase01Metrics = {
    "train": None,
    "test": None
}
if updatePhase01Metrics:
  phase01Metrics["train"] = project.phase1(relDocsPerTurn, updateElasticSearchResults, updateUtterances, es, testBed, topics, relevanceJudgments, topicsToIgnore, thisSetName)
  project.saveMetrics(phase01MetricsFileName["train"], relDocsPerTurn, phase01Metrics["train"])
  project.savePlotLabels(phase01Metrics["train"])
else:
  phase01Metrics["train"] = project.loadMetrics(phase01MetricsFileName["train"], relDocsPerTurn)

## Retrieval with the test conversations

In [7]:
# variables
topics = testBed.test_topics
topicsToIgnore = topicsIDs["test"]
relevanceJudgments = testBed.test_relevance_judgments
thisSetName = "test"

if updatePhase01Metrics:
  phase01Metrics["test"] = project.phase1(relDocsPerTurn, updateElasticSearchResults, updateUtterances, es, testBed, topics, relevanceJudgments, topicsToIgnore, thisSetName)
  project.saveMetrics(phase01MetricsFileName["test"], relDocsPerTurn, phase01Metrics["test"])
  project.savePlotLabels(phase01Metrics["test"])
else:
  phase01Metrics["test"] = project.loadMetrics(phase01MetricsFileName["test"], relDocsPerTurn)

## Passage re-Ranking
The Passage Ranking notebook example illustrates how to use the BERT service to compute the similarity between sentences. Using the BERT service, improve a passage ranking method to rerank the initial retrieval step.

To evaluate the results you should use the provided `ConvSearchEvaluation` class.


In [8]:
# variable
topicsTrain = testBed.train_topics
topicsTest = testBed.test_topics
topicsToIgnoreTrain = topicsIDs["train"]
topicsToIgnoreTest = topicsIDs["test"]
relevanceJudgmentsTrain = testBed.relevance_judgments
relevanceJudgmentsTest = testBed.test_relevance_judgments
thisSetNameTrain = "train"
thisSetNameTest = "test"

if updatePhase02Metrics:
  phase02Metrics = project.phase2(relDocsPerTurn, updateElasticSearchResults, updateBERTResults, es, testBed, topicsTrain, topicsTest, relevanceJudgmentsTrain, relevanceJudgmentsTest, topicsToIgnoreTrain, topicsToIgnoreTest, thisSetNameTrain, thisSetNameTest)
  project.saveMetrics(phase02MetricsFileName, relDocsPerTurn, phase02Metrics)
else:
  phase02Metrics = project.loadMetrics(phase02MetricsFileName, relDocsPerTurn)

## Conversation Context Modeling

Conversation State Tracking example ilustrates how to use the 

To evaluate the results you should use the provided `ConvSearchEvaluation` class.


In [9]:
# variable
topicsTrain = testBed.train_topics
topicsTest = testBed.test_topics
topicsToIgnoreTrain = topicsIDs["train"]
topicsToIgnoreTest = topicsIDs["test"]
relevanceJudgmentsTrain = testBed.relevance_judgments
relevanceJudgmentsTest = testBed.test_relevance_judgments
thisSetNameTrain = "train"
thisSetNameTest = "test"

#if updatePhase03Metrics:
[method1Metrics, method2Metrics, method3Metrics, methodFinalMetrics] = project.phase3(updateUtterances, updatePhase03Metrics, updatePhase03Method01Metrics, updatePhase03Method02Metrics, updatePhase03Method03Metrics, updatePhase03MethodFinalMetrics, relDocsPerTurn, updateElasticSearchResults, updateBERTResults, es, testBed, topicsTrain, topicsTest, relevanceJudgmentsTrain, relevanceJudgmentsTest, topicsToIgnoreTrain, topicsToIgnoreTest, thisSetNameTrain, thisSetNameTest)
#project.saveMetrics(phase03MetricsFileName, relDocsPerTurn, phase03Metrics)
#else:
#phase03Metrics = project.loadMetrics(phase03MetricsFileName, relDocsPerTurn)

if updatePlots:
  APs = [
         phase01Metrics["test"]["aps"], phase02Metrics["aps"], 
         method1Metrics["aps"]["lmd"], method1Metrics["aps"]["bert"], 
         method2Metrics["aps"]["lmd"], method2Metrics["aps"]["bert"], 
         method3Metrics["aps"]["lmd"], method3Metrics["aps"]["bert"], 
         methodFinalMetrics["aps"]["lmd"], methodFinalMetrics["aps"]["bert"]
         ]
  nDCGs = [
           phase01Metrics["test"]["ndcg5s"], phase02Metrics["ndcg5s"], 
           method1Metrics["ndcg5s"]["lmd"], method1Metrics["ndcg5s"]["bert"], 
           method2Metrics["ndcg5s"]["lmd"], method2Metrics["ndcg5s"]["bert"], 
           method3Metrics["ndcg5s"]["lmd"], method3Metrics["ndcg5s"]["bert"], 
           methodFinalMetrics["ndcg5s"]["lmd"], methodFinalMetrics["ndcg5s"]["bert"]
         ]
  Recalls = [
             phase01Metrics["test"]["recalls"], phase02Metrics["recalls"], 
             method1Metrics["recalls"]["lmd"], method1Metrics["recalls"]["bert"], 
             method2Metrics["recalls"]["lmd"], method2Metrics["recalls"]["bert"], 
             method3Metrics["recalls"]["lmd"], method3Metrics["recalls"]["bert"], 
             methodFinalMetrics["recalls"]["lmd"], methodFinalMetrics["recalls"]["bert"]
             ]
  Precisions = [
                phase01Metrics["test"]["precisions"], phase02Metrics["precisions"], 
                method1Metrics["precisions"]["lmd"], method1Metrics["precisions"]["bert"], 
                method2Metrics["precisions"]["lmd"], method2Metrics["precisions"]["bert"], 
                method3Metrics["precisions"]["lmd"], method3Metrics["precisions"]["bert"], 
                methodFinalMetrics["precisions"]["lmd"], methodFinalMetrics["precisions"]["bert"]
                ]
  labels = project.loadPlotLabels()
  convNumbers = labels["convNumbers"]
  convNames = labels["convNames"]
  project.doPlots(relDocsPerTurn, "test", "", APs, nDCGs, Recalls, Precisions, ["LMD", "BERT", "Concatenate LMD", "Concatenate BERT", "Entities LMD", "Entities BERT", "T5 LMD", "T5 BERT", "T5 with Entities LMD", "T5 with Entities BERT"], convNumbers, convNames)

Output hidden; open in https://colab.research.google.com to view.